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Abstract 

Circuit topology refers to the arrangement of interactions between objects belonging to a linearly 
ordered object set. Linearly ordered set of objects are common in nature and occur in a wide range 
of applications in economics, computer science, social science and chemical synthesis. Examples 
include linear bio-polymers, linear signaling pathways in cells as well as topological sorts appear¬ 
ing in project management. Using a statistical mechanical treatment, we study circuit topology 
landscapes of linear polymer chains with intra-chain contacts as a prototype of linearly sorted ob¬ 
jects with interactions. We find generic features of the topological space and study the statistical 
properties of the space under the most basic constraints on the occupancy of arrangements and 
topological interactions. We observe that a set of correlated contact sites (a sector) could nontriv- 
ially influence the entropy of circuits as the number of involved sites increases. Finally, we discuss 
how constraints can be inferred from the information provided by local contact distributions in 
presence of a sector. 
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I. INTRODUCTION 


Molecular chains with intra-molecular contacts are topologically diverse. This is true 
even in the absence of branching and knot formation and is rooted in the variety of contact 
arrangements available to the chain (Fig. [H left). Circuit topology formalizes this notion 
using discrete mathematics and sets the stage for classihcation as well as functional and 
evolutionary analysis of (bio-)molecular chains based on their structural topology [l|. Two 
chains with the same circuit topology may differ in total length and length of inter-contact 
segments but have equal number of contacts with identical contact arrangements. 


Circuit topology of a chain is a determinant of its function and dynamics and is in 


turn determined by the physico-chemical properties of the chain and its environment 


M- 


For example, folding rate of an isolated chain correlates with the number of contact pairs 


in parallel arrangement [2|. The topology influences whether distinct intermediate states 
are visited during folding and unfolding. Here crossed contacts are found to be the key 


determinants . On the other hand, intrinsic and extrinsic factors determine the topological 
diversity of biomolecules. It is well established that the distribution of positively charged 


residues is a key determinant of membrane protein topology j^. When folding of a chain 
occurs while the chain is sequentially synthesized, certain topologies might be kinetically 
populated. Evolutionary constraints also select for certain topologies with desired stability 
and functionality Slow folding chains are prone to unwanted interactions and aggregation 
and thus are disfavored. The constraints, such as those discussed above, are not represented 
in the free energy landscape of the chain, which maps conformations of isolated chains to 
their corresponding free energies. 


Chain models with contacts have served as prototypes in theories of biomolecular chains 
and in particular RNA and protein folding and can be used to study circuit topology. Because 
the length of the chain is irrelevant in a topological treatment, the model can be further 
simplihed by considering only the contacts and setting the length of every chain segment 
to unity. Contacts can be displayed as links between the contact sites. The chain will then 
be modeled by a connected graph in which the nodes correspond to the contact sites, the 
ordered sequence of contact sites corresponds to the polymer backbone and the remaining 
links represent the intra-chain contacts. The latter forms a perfect matching of the graph. 
This graph representation of the chain shrinks the conformational space of the chain to a 
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topological space. 


Studying the topological diversity is technically challenging. The problem of sampling 
from the exponentially large space of contact conhgurations (perfect matchings) could be 
very time consuming for disordered and frustrated energy functions. An efficient way of 
sampling from such energy landscapes in sparsely (weakly) interacting systems i^rovided 
by the cavity method of statistical physics, relying on the Bethe approximation |5|-[7|. The 
recursive and local nature of these equations are exploited in approximate message-passing 
algorithms that have proven useful in the study of random constraint satisfaction and opti¬ 


mization problems 


13|. 


In this article, we discuss the statistical mechanical properties of a single chain that 
forms intra-chain binary contacts in the context of circuit topology. We use the Bethe 
approximation to characterize the space of contact (link) conhgurations assuming an energy 
function of two-link interactions depending on their relative position in the polymer chain. 
We illustrate the constraints imposed by the energy function on the conhguration space, and 
obtain the one-link and two-link probability distributions to see how the links are organized 
by changing the relevant parameters in the energy function. We also obtain the entropy 
(logarithm of the number of contact conhgurations) in terms of the two-link densities in the 
energy function of the system. We will see how a subset of correlated contact sites identihed 
by a sector |l^ ahects the statistical properties of the chain. In particular, for specihc 
frustrating energy functions the entropy displays a maximum for sector sizes close to half 
the number of contact sites. Finally, given the one-link and two-link data from structured 
link conhgurations, we try to reconstruct the energy function that statistically describes the 
observed data. Using this information, we can recover the contact sites of regular sectors of 
diherent sizes with an accuracy that approaches one as the number of sector sites increases. 


II. DEFINITIONS 

A chain with M contacts is represented with a graph containing M links with endpoints 
= {ihji) labeled by / = 1,..., M with p, j) £ {1; • • •) 2M}. The chain is directed from 
left to right C = {1,2,..., 2M — 1, 2M}. A link conhguration is dehned by L = {(p, jf)|/ = 
1,...,M} where ii ^ ji and links are not directed {iuji) = To any pair of links 

one of the three states may be assigned with respect to the backbone chain C: parallel (p). 
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FIG. 1. (top-left) Arrangements of two links 1,1' in series (s), parallel (p), and cross (x) states. A 
link can be represented by its endpoints or by its first endpoint and length (e, r). Distance d 

of two links is the separation of their first endpoints, (top-right) Links labeled according to their 
hrst endpoints to show the possible valid configurations for link I = 6 given the links I' = 1,..., 5. 
(bottom) A sample link configuration in presence of a regular hard sector in the middle of the 
chain. 

series (s), or cross (x), see Fig. [T] We will study topologically different link configurations 
represented by an M x M matrix Ai ii G {p,s,x}. Note that physical length is not relevant 
to our study and thus we do not care if the simplihed model with the presented length prohle 
has a physical 3D realization or not. 

Consider the perfect matchings of the 2M nodes i = 1,...,2M on chain C where a 
perfect matching conhguration is dehned by M = {cij = 0, = l,Vi}. Each 

perfect matching represents a class of topologically equivalent link conhgurations related by 
a permutation of the link labels. Note that in a perfect matching labels are assigned only to 
the endpoints whereas in a link conhguration both the endpoints and the links are labeled. 
The number of perfect matchings is = (2M — 1)!! = (2M — 1) x (2M — 3) • • • x 1 and 
for each one there are M! ways of labeling the links. In other words there are M! matrices 
A representing the set of topologically equivalent link conhgurations. 

A link conhguration of M links is composed of = M{M — l)/2 link pairs that can be 
classihed into three disjoint subsets of size Np, Ng, depending on their states p, s, x. The 
links can have an arbitrary labeling; one special case is to order the links from left to right 
according to their hrst endpoints. At some point in this paper, we will consider structured 
link conhgurations with sectors; a sector is identihed by an arbitrary set of endpoints that 
remain (possibly diherently) connected in diherent link conhgurations; we say a sector is 
hard if any connection between the sector sites and the other sites is forbidden. Figure [T] 
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illustrates the definitions and notations used throughout this paper. 


III. CHARACTERIZING THE CONFIGURATION MATRIX 

Given a perfect matching and a link labeling, one can easily construct the unique matrix 
A; the endpoints {iuii) and {iv^jv) are enough to identify the element A; ;/. On the other 
hand, given a matrix A and a labeling, one can find the unique matching configuration 
corresponding to the matrix (if there exists) by solving the following constraint satisfaction 
problem 


w=En <^A, ;/,q(ei,e;/)- (1) 

e KU KV 

In words, we find the endpoints e; = that make a perfect matching and are consistent 

with the matrix A. Here = 1 if c/ and e;/ represent two disjoint links with different 

endpoints, otherwise it is zero. And q(ei, e;/) G {p, s, x} returns the state of links I and I' given 
their endpoints. The indicator function 1(A) represents the constraints on the valid matrices; 
1(A) = 1 if A is a valid matrix, otherwise it is zero. It seems that the constraints on the 
matrix elements An/ can not be expressed in a local way; one needs to consider the constraints 
imposed on any two matrix elements, three elements, and more. From a computational point 
of view, the problem of deciding on the validity of an arbitrary configuration matrix could 
be hard, but we will see that at least for a class of ordered matrices this problem is easy. 

Suppose the links are ordered according to their first endpoints, that is b' < ii ii I' < 1. 
We assume that ii < ji for any link /. Given matrix A, we add the links 1,2,3,... one by one 
(see Fig. [T], right) to find the matching configuration. In step I we add link I and determine 
the link configuration just by looking at the matrix elements Ai^i> for V < 1. We have to 
determine the relative position of the endpoints with respect to the < /}• 

Glearly p > b' for all I' < I and we need to consider only the other endpoints ji>. The 
relative position of these endpoints have already been determined in the previous steps of 
the process, say ji < j 2 < • •' ji-i- Then we group the previous links according to the matrix 
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elements: Qp = = p}, with Qs and Qx defined in a similar way. Now it is clear that 

ii < ji', ji < ji' HI'e Qp, 

' > ji', ji > ji' Hl'egs, (2) 

< ji', ji > ji' Hl'egx. 

This defines the relative position of link / given links I' < 1. 

The previous paragraph somehow defines the constrains between the matrix elements A;;/. 
As before suppose the links are ordered according to their first endpoint, and we are to add 
link I given the configuration of links 1, 2,— 1. This means that in matrix A we are at 
row I and we want to specify the valid matrix configurations in that row {A;;/ \l' < /}. The 
above arguments say that these elements are constrained to the following configurations: 

il ■ ■ • ja h ja+1 ■ ■ ■ jb ji jb+1 ■ ■ ■ jl—l (3) 

The endpoints jr that happen before ii belong to the links of group gs, those that happen 
after ji belong to the links of group gp, and the middle ones belong to the links of group gx- 
In short, the matrix elements in rows 1, 2,— 1 define the set of possible matrix elements 
in row 1. 

IV. CHARACTERIZING THE MATCHING CONFIGURATIONS 

Consider matching configurations with a given number Np,Ns, Nx of link pairs {I, I') of 
type p, s, X, respectively. We take N = Np + Ng + Nx = M{M — l)/2 and define the energy 
function E{np,ns,nx) = —M In M{Xpnp +XsTigE Xxnx) with densities np^s,x = Np^s,x/N. The 
factor M In M is chosen to have the same scaling for the energy and the leading term of the 
entropy function S{np, Ug, nx)'-, we recall that the total number of link configurations scales as 
gMinM large M. Figure [2] shows the exact entropy we obtain for a small number of links; 
the entropy goes to zero for the all-(p, s, x) configurations in the corners of the entropy plot, 
gets larger values when two types of contacts are allowed, and finally takes its maximum 
value for the neutral choice of the energy parameters Xp^g^x = 0, where Up^g^x = 1/3. In the 
figure we also see how the energy parameters control the two-link densities Up^g^x'-, increasing 
\,s,x increases the probability of having a configuration with more of the corresponding type 
of contact. Note that the entropy distribution is broader in the Ug direction and approaches 
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FIG. 2. The entropy S = Min A/ vs the two-link densities Up^s = Np,s/N, and the two- 

link densities vs the energy parameters Xp^g for Aj; = 0 , obtained by an exhaustive enumeration 
algorithm for M = 9 links. J\f is the number of configurations. 


a nonzero value for ^ 0. We observe the same behaviors for larger number of links using 
the following approximate algorithm. 

Let us represent a perfect matching by the set of endpoints e = {e^ = = 1,..., M} 

assigned to the M link variables, such that for any two different links e/ 7 ^ ep. Then, we 
consider the following probability measure in the space of link conhgurations 


/i(e) OC JJ (Set^epe 




KU 


(4) 


with Xp^s^^ = 2^^Xp^s,x- 

We will compute the local marginals fii{ei), ... of the /i(e) within the Bethe 

approximation. Moreover, we assume the relevant link conhgurations are organized in a 
simple and connected region of the conhguration space; this is called the replica symmetric 
approximation a. To th. end. we need to wtite the teeuteive equations fot the cavity 
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marginals of having endpoints ei for link I in the absence of link I', 







( 5 ) 


These are the so-called belief propagation (BP) equations 0, Q . Note that for large M 
some of the cavity marginals p;_>.//(e;) could take very small values increasing the numerical 
errors. To get around this problem, one can instead work with the cavity fields hi^ii{ei) = 
In f respect to a reference link variable Cq. 


The above equations can be solved by iteration starting from random initial messages. 
The fixed-point cavity marginals are enough to compute the interesting average quantities 
{np^s,x), and the entropy S{np,ns,nx) as described in more details in Appendix [Al In addi¬ 
tion, we explain another efficient representation of the problem working with the matching 
variables Cij G {0,1}. In Appendix [Cl we give the parameter values for which the BP algo¬ 
rithm converges; the maximum entropy region around Xp^s,x = 0 is surrounded by a region 
that the algorithm does not converge, but still the entropy has a signihcant value. This 
could happen if strong correlations impose a more complicated organization of the relevant 
link conhgurations in the conhguration space j^. 


Figures [3l and jH display some typical one-link and two-link distributions obtained by the 
Bethe approximation. We see the structural properties of the link configurations change con¬ 
siderably around the origin of the parameter space \p^s,x = 0, where the one-link probability 
distribution fii{e,r) is uniform. In particular, the one-link distributions in Fig. [3] show that 
for (Ap = 0, Xs^x = 1) we have short links that are mostly concentrated at the beginning and 
at the end of the chain in two communities. The other cases shown in the figure correspond 
to simpler structures dominated by one type p, s, x, or a superposition of two types. The 
two-link distance distributions in Fig. 0] show that, as expected, there is always a 

nonzero length scale d* separating two links that are in series. However, starting from the 
neutral parameter values Xp^s,x = 0, the distance d* behaves differently by increasing the 
number of parallel or crossing two-links. In Appendix [C| we give the link distributions for 
more instances of the parameters along with a comparison of the approximate and exact 
data for small number of links. 






FIG. 3. One-link distribution (more precisely M(2M — l)//(e,r)) obtained by the Bethe approx¬ 
imation for M = 50 links and different energy parameters Xp^s,x- Here /r(e, r) is the probability 
of having a link with the first endpoint e and length r. The one-link distribution is uniform and 
thus trivial for Xp^s,x = 0 (not shown here), (top-left) Giving more weight to parallel two-links 
results to long links with first endpoints concentrated in the beginning and the first half of the 
chain, (top-middle) Giving more weight to series two-links results to short links with first endpoints 
nearly uniformly distributed along the chain, (top-right) Giving more weight to cross two-links 
results to links that are anywhere, but are of a particular length, (bottom-left) Giving less weight 
to cross two-links leads to a mixture of short and long links distributed uniformly along the chain, 
(bottom-middle) Giving less weight to series two-links leads to links with a broad range of lengths, 
and starting mostly at the beginning of the chain, (bottom-left) Giving less weight to parallel 
two-links leads to nearly short links concentrated mostly in the beginning and end of the chain. 


V. SECTORS: IMPLICATIONS AND INFERENCE 

Let us consider a hard sector where any link connects either two sites inside or outside 
the sector set. The sector is defined by an arbitrary subset of contact sites S = {ii,..., i| 5 |} 
that could be distributed randomly or regularly, and energy parameters (A^,, j,, A^f j,, A^,, 3 ,). 
These parameters specify the relative importance of two-link arrangements for two links in 
the sector (Ap_^ ,j,), one link in the sector and the other not in the sector (A^f,,.), and two links 
not in the sector (Ap^^,). Using this information, we can hnd an estimation of the entropy 
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FIG. 4. Two-link distance distribution fiii'{d) (multiplied by a constant to make it of order one) 
for different energy parameters Xp^s.x and different types of two-links {p,s,x), obtained by the 
Bethe approximation for M = 50 links. Here piii^q{d) is the probability of finding two links of type 
q = p, s, X at distance d (separation of the hrst endpoints) from each other. 


and other statistical properties of the chain by the BP eqnations given above; bnt, now an 
endpoint inside the sector can be connected only to another one in the same sector. In this 
section, we mainly focus on the inverse problem of inferring the sector from an appropriate 
set of observation data. Obviously, we hrst need to solve the froward problem of computing 
the expectation values of the relevant quantities, as explained in the previous section. 

Figure [5] shows the number (entropy) of link conhgurations and two-link densities np^s,x 
vs the number of randomly selected sector sites (sector size |iS|) for hxed energy parameters. 
Here, we are giving more weight to parallel two-links inside the sector and vary the other 
energy parameters. The entropy is, of course, larger for very small or large sector sizes 
than for intermediate sizes. However, depending on the energy parameters, the entropy 
could display a local maximum for sector sizes around L/2. In the same region, we observe 
convergence problems in the BP algorithm signaling the presence of strongly correlated link 
variables. Note that the number of forbidden link conhgurations increases by the size of 
sector. On the other hand, when the energy parameters j,, A^^^ 3 ,) are diherent, 

new link conhgurations could appear as the sector size increases. The local maximum in the 
entropy is observed if the number of new conhgurations dominates the number of forbidden 
ones. When this happens, as the hgure shows, the diherences in the two-link densities 
become smaller making the system closer to the absolute maximum entropy point, where 
^p,s,x 1/3. 

The energy function we considered in the above sections was indeed devised to explore 
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FIG. 5. The entropy S and link densities np^s,x vs the sector size obtained by the Bethe ap¬ 
proximation for fixed energy parameters {^p^s,x^ ^p,s,x) with M = 40 links. Here S stands 

for two links in sector, S for two links not in sector, and SS for one link in sector and the other 
not in sector. To shorten the notation, we use for example {p,x,s) to show (A^ = = 0), 

(Af*^ = l,Apf = 0), and (Af = l,Ap,,, = 0) highlighting only the nonzero parameters. The data 
are averaged over 100 realizations of randomly selected sector sites with relative errorbars of order 
0 . 01 . 


the configuration space for different densities np^s,x- To have more control on the structure 
of the link conhgurations we have to consider more general energy functions. Suppose we 
are given the average numbers M*(r) of links of length r, and numbers N*g^{d) of link pairs 
of type v, s,x with distance d between their hrst endpoints. From the maximum entropy 


principle 


20 | . the right energy function to model the system is 


r \ I / q=p,s,x d \l<l' 




( 6 ) 


Given the parameters A(r) and Xp,s,x{d), we can use Bethe approximation as before to 
compute the averages {M{r)) and {Np^s^j.[d)). In the inverse problem, we are given the 
average numbers, and look for the energy parameters describing the data 17 h19|. 


In practice, we solve the inverse problem by iteration [2l|, l22|: Starting from an initial 
set of parameters we compute the above averages within the Bethe approximation. We then 
apply incremental changes to the parameters according to deviations of the average values 
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from their target values M*{r) and N*g^^{d), 

6X{r) = ri[M*{r) — {M{r))], (7) 

= r, [«■;,,.,(<;) - (/v,,...(d)>], (8) 

with 0 < r; 1. Figure |6] shows the one-link probability distributions in the reconstructed 
model obtained, using this protocol, for a system of link variables in the presence of a 
hard sector. Here the sector consists of L/2 sites in the middle of the chain, and the data 
come from randomly generated link conhgurations respecting the constraints imposed by 
the sector (Fig [1] shows one sample conhguration). As the hgure shows, we can observe the 
signature of the sector already in the one-link distribution /i/(e, r). Finally, we can repeat 
the above procedure to hnd better models, but this time we add an external held to disfavor 
the less probable connections suggested by /i/(e,r) in the previous stage (see Appendix 
In principle, to infer the sector sites we need to study the likelihood of the model 
oc Pr(cr)Pr('D|A, cr), where cr dehnes the position of the sector sites, and Pr(cr) gives the 
prior probability of having cr. Pr(D|A, cr) is the probability of observing the data TX given 
the model parameters A and the sector cr. Here, for simplicity, we try a naive two-stage 
strategy using the reconstructed one-link probability distribution. 

More specihcally, given the we infer the sector contact sites by maximizing the 

probability 



P(cr) oc JJ 


a 






[1 


a 


ij\ 




(9) 


i<j 


Here (i = e, j = e -|- r), and aij = M/i;(i, j)(l — is the connection probability. 

Moreover, Uj = 1 if site i belongs to the sector, otherwise cTj = 0. To hnd the cr maximizing 
the above probability, we make use of the Bethe approximation to compute the marginals 
of the joint probability distribution oc where (3 is an inverse temperature 

parameter. Then, we take the limit (3 ^ oo oi the hnite-temperature BP equations to 
obtain the so-called minsum equations |^, by assuming an appropriate scaling for the BP 
marginals (see Appendix [B]) , 


hi^j = ^ max(ln(l - aik), InUik + hk^i) - ^ max{\n aik, ln(l - Uik) + hk^i). (10) 

k^i,j 

We solve the equations for the cavity messages hi^j by iteration, and compute the local 
messages hi considering all the incoming messages from the neighboring variables. These 
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FIG. 6. (left) The one-link probability distribution fii{e, r) obtained by the inverse algorithm given 
the average numbers M*{r),N*g^^{d) extracted from 10000 randomly generated configurations of 
M = 20 links in presence of a hard sector of size L/2 in the center of the chain. Here (e, r) gives 
the first endpoint e and length r of the link. We note the presence of short link inside and outside 
the sector in addition to a subset of long links leaping over the sector, (right) Accuracy true 
positive -|- ^ true negative)/(total number of sites) of the inverse algorithm using the reconstructed 
one-link distribution /r;(e, r) to infer regular sectors of different sizes in the center of the chain. 


messages are used in a decimation or reinforcement algorithm 1 ^ to find a good represen¬ 
tative of the minimum energy conhguration. In this way, we could infer the sector sites in 
a system of M = 20 links with an accuracy, (# true positive -|- # true negative)/(total 
number of sites), that approaches 1 for regular sectors of size h/2, see Fig. Ei The accuracy 
is of course smaller for smaller (also for random) sectors where one needs more accurate 
inference algorithms considering the whole likelihood of the model to capture the subtle 
sector information in the data. 


VI. DISCUSSION 


Circuit topology is a molecular property that describes arrangement of intra molecular 
contacts and critically determines dynamics and function of biomolecules such as proteins 
and nucleic acids [l|. Here we explored the space of available circuit topologies through 
clarifying fundamental constraints on the conhguration matrix A. We then studied how one 
can construct a link conhguration given an ordered conhguration matrix. We showed that 
the problem of deciding on the validity of such a conhguration matrix is computationally 
easy. 
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We used the cavity method of statistical physics to explore the space of link configurations 
using an energy function of the two-link densities np^s,x- Exact computation of the entropy 
function S{np,ng) for small number of links shows that conhgurations with small are 
more frequent than those of small Up. Approximate computations for larger number of links 
show similar behaviors for the entropy function. Moreover, the convergence pattern of the 
BP algorithm suggests that link variables are more correlated in the region of small Up than 
for small rig or Ux- This is close to the region that typical link configurations exhibit some 
degree of modularity, as observed for Ap = 0, Xg^x = 1 in Fig. [3l 

The analysis of the one-link and two-link probability distributions showed that structure 
of link conhgurations changes considerably by varying the energy parameters Xp^g^x (con¬ 
jugate to densities Up^g^x)] in particular, the typical conhgurations for Ug^x > np exhibit a 
modular structure with components concentrated on the two sides of the chain. For the 
neutral choice of the energy parameters Xp^g^x = 0, two parallel or cross links are mostly 
found close to each other, whereas two series links are separated by a nonzero characteristic 
length scale d*. The distance d* decreases by increasing np, where two parallel links are 
typically found at distance d*. Similarly, increasing Ux dehnes a nonzero length scale d* but 
this time d* increases. 

The constraints imposed by sectors of diherent sizes ahect the statistical properties of the 
system in an unexpected way; in fact, for hxed energy parameters Xp^g^x, the entropy could 
display a local maximum for an intermediate value of the sector size. Finally, we used the 
one- and two-link statistics in a learning algorithm to reconstruct the energy function that 
describes statistically the typical link configurations in presence of sectors. The information 
contained in the reconstructed model enables us to infer a meaningful number of the sector 
sites by a naive two-stage algorithm. In particular, we can successfully infer regular sectors 
of size L/2 in the center of chain. 

The simplicity of the model system used in this study serves in disentangling the role of 
topology from other structural features such as size, steric constraints and chemical structure 
of the polymer building blocks. In realistic settings, considering non-topological properties 
is often inevitable. We note that the energy constraint used in this study can be tailor-made 
for different molecular systems based on experimental data. Our aim in this article was to 
provide proof-of-concept study of using statistical mechanics to molecular topology. 
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VII. CONCLUSION 


Structural topology of a linearly sorted multi-component system is often a critical deter¬ 
minant of its function and dynamics, thus controlling the topology of the system is of prime 
importance. In bimolecular systems, native topology reflects the evolutionary constraints 
imposed on the systems, while in engineering systems one needs to constraint the space 
of available topologies to ensure desired outcomes. For biomolecular system, arrangement 
of intramolecular contacts is the most relevant topological feature. Here, using a simple 
model of folded linear biomolecule, we explored the space of available contact arrangements 
and examined the impact of constraints on the topology of folded linear chains. Our ap¬ 
proach enables identihcation of the underlying structural design principles and inference of 
associated evolutionary constraints and, as such, it potentially helps in understanding the 
functioning and the evolution of these systems. Further it may inspire engineers to build 
molecular systems with new functionalities for technological applications. 

Structural topology of a molecular system is often a critical determinant of its function 
and dynamics, thus controlling the topology of the system is of prime importance. For 
linear biopolymers, arrangement of intramolecular contacts is the most relevant topological 
feature. Here, using a simple model of folded linear biomolecules, we explored, for the hrst 
time, the space of available contact arrangements and examined the impact of constraints 
on the topology of folded linear chains. We found that the form of the imposed constraints 
critically determine not only the pairwise arrangement of contacts but also the global shape 
of the folded polymer. This capability allowed us to systematically search for the emergence 
of folded domains and sectors by varying the relevant order parameters. Our hnding is 
particularly important for molecular engineering where one needs to constraint the space of 
available topologies to ensure desired functional outcomes. 

Our approach enables identihcation of the underlying structural design principles and 
inference of associated constraints and, as such, it potentially helps in understanding the 
functioning and the evolution of folded biomolecules, ranging from protein and RNA to 
chromosomes (e.g. topologically associated domains in chromosomes). In biomolecular 
systems, native topology rehects the constraints imposed on the systems during the synthesis 
as well as those imposed in the course of evolution. We demonstrated how one could infer the 
form of the imposed constraints from naturally occurring arrangement of contacts, which can 
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in turn be readily extracted from coordinate files available in the databases. In the future, 
we foresee application of this approach to studies on molecular evolution as well on in vivo 
molecular folding processes where a number of physical constraints guide the conformational 
search of biopolymers to their native states. Further, the capability to infer constraints from 
natural systems may inspire engineers to build molecular systems with new functionalities 
for technological applications. 
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Appendix A: Details of the belief propagation equations 


We start from the partition function 


Z{Xp, Xs, ^ n , (Al) 

e Kl' 


which is a weighted sum over the link configurations e satisfying the perfect matching 
constraints. We divided the right hand side by M! to cancel the overcounting resulted 
from different link permutations. For large M, the partition function can be rewritten as 


Z{Xp, Xs, X^) = 




Here eWinM)S(np,ns,nj;) jg number of matchings of given densities np^s,x = Np^s,x/N. Note 
that the total number of perfect matchings is ein( 2 M)!-inM!-Min 2 fQp large M scales as 

gMinM-M(i-in 2 )_ Moreover, we have Xp^s,x = ‘^^^Xp^s,x- 

We will use the Bethe approximation to compute 0(Ap, A^, A^,), and then by the Legendre 
transformation we will obtain the entropy function, 


S{n*, nl, nl) = 0(Ap, A^, A^,) - ApU* - A^u* - X,,n 


p'fp 


(A3) 


The values are determined by the saddle-point equations. 


K = = 


d(j) 

ay = 


_ 1 N 


(A4) 


and similarly for n* 3 ,. 

The central quantities in the Bethe approximation are the cavity marginals 


giving the probability o 
constraints involving ei' 


raving endpoints ei for link I in the absence of interactions and 


15|. The cavity marginal is obtained by considering the 


cavity messages from the other variables Hin^i^ein), and the local constraints depending on 
the (e;,ez//). 


l"^l,V \eiii^ei 

We solve the equations by iteration, starting from random initial marginals and updating 
the in a random sequential way according to the above equations. 
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Having the cavity marginals, the two-link marginals read 






(A6) 


The free energy in the Bethe approximation is given by (M In M)(j) = ~ 

InM!, where Acpi and Acpui are the local free energy shifts These are the changes in the 
free energy after adding the constraints and the energy terms depending on e;, and those 
that involve {ei,eii), namely, 


gA<Ai ^ 




(A7) 

(A8) 


ei I'^l yej/^e; 

— 'y ^ g'^p'^q(e;;ej,),p+'^s'5q(e(;ej,),s+'^a:<5q(e;;ej,),a: ^ 

Here the links are equivalent, so we rewrite the BP equations as 

M-2 

p'^P^q(ege^/),p“t“'^s(^q(ege^/),s + '^a^^q(egey/),a; 

Let US represent ei by its hrst endpoint and its length (e,r), then the above equation reads 




(A9) 


p(e, r) OC [e^^Wp{e^ r) + e^^Wsie, r) + r)] 


M-2 


where 


Wp[e,r] = 


e—1 2M—e' e+r—2 e+r—1—e' 

= Y1 Y1 A(e',A), 

e'=l r'=eH-r+l—e' e'=e+l r'=l 

e-2 e-e'-l 2M-1 2M-e' 

Ws{e,r) = J2 A(e',r')+ h(e',r'), 

e'=l r'=l e'=e+r+l r'=l 

e—1 e+r—1—e+r—1 2M—e' 

w^{e,r) = Y^ h(e',r'). 

e'=l r'=e—e'+l e'=e+l r'=e+r+l—e' 


Similarly, we obtain 


2M-1 2M-e 


lM-1 


gA 0 i ^ [eAt(;p(e, r) -|- e^'’Ws{e, r) + e^^Wxje-, r)]^ 

e=l r=l 
2M-1 2M-e 

gA 0 „, ^ ^ ^(g^ r)[eAt(;p(e, r) -|- e'^’’Ws{e, r) -b e^^Wxje, r)]. 


e=l r=l 


Moreover, we have 


2M-1 2M-e 


(up) = e ^Ai'g-^p ^ ^ fi{e,r)wp{e,r) 


(AlO) 

(All) 

(A12) 

(A13) 


(AM) 

(A15) 


(A16) 


e=l r=l 
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1. An alternative representation 


We may as well use the matching variables Cij G {0,1}, showing the connectivity of nodes 
i and j, to rewrite the partition function lAll as 


^(Ap,A„A,) = 






— 1 


n 




(A17) 


c i {ij)<{kl) 

This representation of the problem is more efficient than the one we used above but the 
BP equations are more involved; we have to distinguish between two kinds of BP message 
and The former is the probability of in absence of the matching 

constraint Ii{cQi) = latter is computed in absence of two-link interaction 

{ij) ,{kl){pij 1 ^kl) — 6Xp(CjjC/j; [Ap(5qpj-fcp p -|- ^s^q{ij',kl),s T ^x^q{ij-,kl),x\') • Here CQi = {Cjj . j ^ i}. 

The BP equations governing these cavity marginals are 


h'(q )—1 ^ ^ lijcdi) h'(ifc)— 


X n E ^ {ij),{kl)iS'ij 1 Ckl)p(kl)^{ij)ifikl) j i (-A-18) 

{kl):k,l^i,j V Cki 


and 


p{ij)^{kl){Cij) OC I ^ hjcQi) P(^ik)^i{Cik) j I ^ ^ P{jk)^j{,Cjk) j 

\^di\3 k£di\j J \<^dj\i k£dj\i J 

X n E U!{ij),{k'l'){Cij,Ck'l')p{k'l')^{ij){Ck'l') ■ (A19) 

(k'l')^{kiy.k',l'^i,j \cyi, j 

Similarly, we can compute the one-link marginals and the two-link marginals 

^kl)- 


Appendix B: Details of the minsum equations 

Consider a system of interacting site variables (7* G {0,1} for i = 1,..., L, with energy 
function 8{cr) = where 

8ij{(Ji, aj) = -[(Ti(Tj + (1 - cTj)(l - CTj)] Inaij 

- [1 - aiaj - (1 - cTi)(l - aj)] ln(l - aij). (Bl) 


20 


The aij are here parameters, giving the probability of having a link connecting sites i,j. 

We start from the finite-temperature BP equations for the cavity marginals of the prob¬ 
ability distribution of variable conhgurations V{cr) oc 


V o'fe / 


(B2) 


V O'fe 

This is the probability of state (jj for site i in the absence of interaction with site j. It is more 
appropriate to work with the cavity helds hi^j = -f In ( ) where the BP equations read 


In + e 


-^3 — (3 

-l3Sik{l,l)+phk^i{ak)^^ 




Now take the limit /3 —)■ cxo of the above equations. The resulting equations for the cavity 
messages hi^j are called minsum equations [l5| and read 

hi^j = ^ max(ln(l - Q!ifc),lnQ!ifc hk-,i) - ^ max(lnaifc, ln(l - aik) + hk-,i). (B4) 

k^i,j k^i,j 

We solve the minsum equations for the cavity messages by iteration, starting from random 
initial messages. In the end, the local messages hi are obtained like the cavity ones but 
considering all the incoming messages from the neighboring variables. 

To find a conhguration minimizing the energy, we use the reinforcement algorithm 1^: In 
each step of updating the cavity messages, we add external fields that polarize the messages 
in the direction suggested by the local messages. More precisely, the reinforced minsum 
equations read 


= r{t)hl ^ max(ln(l - aik), Inajfc h\ 


k^i) 


k^i,j 


Similarly, we update the local messages 


^ max(lnQ!ifc, ln(l - aik) + K^i). (B5) 

k^i,j 


= r{t)h\ -h ^max(ln(l - aik),\naik + h\^^) 

k^i 

- ^ max(lnQ!ifc, ln(l - aik) + hl^^). (B6) 

k^i 

The reinforcement parameter r{t) is zero at the beginning of the algorithm {t = 0) and 
grows slowly by t, for example as r{t -|- 1) = r{t) -|- 6. 
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Appendix C: More details and figures 


Here we present more details of the numerical simulations and figures obtained in this 
study. 

Figure [7] displays the approximate entropy (logarithm of the number of link configurations 
J\f) that is obtained within the Bethe approximation. Here we take A^; = 0 and report the 
entropy in the space of parameters (Ap, A^) even if the algorithm does not converges. The 
solution to the BP equations IA9I is found by iteration and the algorithm converges when 
the difference in the BP messages fi(ei) in two successive steps of the iteration is less than 
a convergence limit e = 10“®. In the same figure, we observe the region in the parameter 
space that the BP algorithm converges. Given the BP messages, the entropy is computed 
by Eqs. IA3I and IA7IIA8I 

Figures [H [9l and [10] show the one-link and two-link distributions for more parameter 
samples obtained by the BP algorithm as described above. In Fig. [TTl we compare the 
two-link distance distribution obtained by the approximate algorithm with the exact one for 
a small number of links. 

In Fig. 1121 we compare the reconstructed one-link and two-link distributions with the 
observed data from link configurations with a regular sector of size L/2. The inferred 
statistics can be improved by iteration using the information obtained in the previous stages 
of the algorithm. In the figure, we also compare the model data obtained without any prior 
information (a), and with the information provided in the first stage of the algorithm (b). 
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FIG. 7. The entropy S = a/ in a/ In A/* {Af is the number of configurations) obtained by the Bethe 
approximation, and the region that the BP algorithm converges (white region). The data are for 
M = 40 links and A^; = 0. The convergence limit is e = 10“®. 


Xp=1,Xs=X,=0 
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Xs=0,Xp=X,,=1 
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FIG. 8. One-link distribution (more precisely M{2M — l)//(e,r)) obtained by the Bethe approxi¬ 
mation for different energy parameters \p^s,x with M = 50 links. Here /r(e, r) is the probability of 
having a link with the first endpoint e and length r. 


23 










































p Xjj=0,Xp=A.g=1: s X 



FIG. 9. Two-link length distribution nwir^r') (multiplied by a constant to make it of order one) 
obtained by the Bethe approximation for M = 50 links. Here r') is the probability of finding 

two-links of type q = p,s,x with lengths r and r'. The energy parameters Xp^s,x and the average 
two-link densities are fixed in each row. The columns are for different types of two-links: p (left), 
s (center), and x (right). 
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FIG. 10. Two-link distance distribution iiw{d) (multiplied by a constant to make it of order one) 
for different energy parameters \p^s,x and different types of two-links {p,s,x), obtained by the 
Bethe approximation for M = 50 links. Here piii^q{d) is the probability of finding two links of type 
q = p,s,x at distance d (separation of the hrst endpoints) from each other. 
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Exact, M=9: Xp=Xg=X^=0 


Exact, M=9: Xp=1,Xg=X^=0 
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FIG. 11. Comparing the average two-link numbers {Nii/{d)) obtained exactly (top) with those of 
the Bethe approximation (bottom) for M = 9 links. Distance d of two links is the separation of 
their hrst endpoints. Each panel shows (Nn/^d)) for fixed energy parameters Xp^s,x but different 
types p, s, X. 
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FIG. 12. The one- and two-link probability distributions obtained by the inverse 

algorithm given the average numbers M*{r),N*g^^{d) extracted from 10000 randomly generated 
configurations of M = 20 links in presence of a regular sector of size L/2 in the center of the chain. 
Hear r refers to the length of link, and d gives the distance between the hrst endpoints of two 
links. M*{r) and N*g^^{d) are the average number of links of length r and two-links of distance 
d, respectively. Model (a) is obtained by running the inverse algorithm with no prior information 
of the sector. To obtain model (b), we run the inverse algorithm with an additional external held 
disfavoring some connections according to the one-link probability distribution ^i{e,r) provided 
by model (a). fj,i{e,r) is the probability of having a link with hrst endpoint e and length r. 
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